TELECOM CUSTOMER CHURN PREDICTION¶

Introduction

Customer Churn refers to the phenomenon where customers stop using a company's products or services.
In the telecom industry, churn is a major concern because retaining existing customers is significantly more cost-effective than acquiring new ones.

Telecom companies generate massive amounts of data, including customer demographics, account information, service usage patterns, and contract details.
By analyzing this data, we can build predictive models to identify customers who are likely to churn, allowing companies to take proactive measures to retain them.

In this project, we aim to:

  • Understand the key factors influencing customer churn
  • Explore the dataset using Exploratory Data Analysis (EDA)
  • Apply Machine Learning algorithms to predict churn
  • Evaluate model performance and provide actionable insights for customer retention

The main objectives of this project are:

  1. Understand Customer Behavior
    Analyze customer demographics, account details, and service usage to identify patterns that influence churn.

  2. Identify Key Churn Factors
    Determine which factors most strongly contribute to customer churn, such as contract type, payment method, or usage frequency.

  3. Predict Customer Churn
    Build and evaluate machine learning models (e.g., Random Forest, Logistic Regression, SVM) to predict which customers are likely to leave.

  4. Support Decision-Making
    Provide actionable insights that help telecom companies implement strategies to retain customers and reduce churn.

  5. Evaluate Model Performance
    Measure accuracy, precision, recall, and other metrics to select the most effective predictive model.

Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import warnings
warnings.filterwarnings('ignore')
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

Load the dataset

In [2]:
df=pd.read_csv("chrun.csv")
print(df)
      customerID  gender  SeniorCitizen Partner Dependents  tenure  \
0     7590-VHVEG  Female              0     Yes         No       1   
1     5575-GNVDE    Male              0      No         No      34   
2     3668-QPYBK    Male              0      No         No       2   
3     7795-CFOCW    Male              0      No         No      45   
4     9237-HQITU  Female              0      No         No       2   
...          ...     ...            ...     ...        ...     ...   
7038  6840-RESVB    Male              0     Yes        Yes      24   
7039  2234-XADUH  Female              0     Yes        Yes      72   
7040  4801-JZAZL  Female              0     Yes        Yes      11   
7041  8361-LTMKD    Male              1     Yes         No       4   
7042  3186-AJIEK    Male              0      No         No      66   

     PhoneService     MultipleLines InternetService OnlineSecurity  ...  \
0              No  No phone service             DSL             No  ...   
1             Yes                No             DSL            Yes  ...   
2             Yes                No             DSL            Yes  ...   
3              No  No phone service             DSL            Yes  ...   
4             Yes                No     Fiber optic             No  ...   
...           ...               ...             ...            ...  ...   
7038          Yes               Yes             DSL            Yes  ...   
7039          Yes               Yes     Fiber optic             No  ...   
7040           No  No phone service             DSL            Yes  ...   
7041          Yes               Yes     Fiber optic             No  ...   
7042          Yes                No     Fiber optic            Yes  ...   

     DeviceProtection TechSupport StreamingTV StreamingMovies        Contract  \
0                  No          No          No              No  Month-to-month   
1                 Yes          No          No              No        One year   
2                  No          No          No              No  Month-to-month   
3                 Yes         Yes          No              No        One year   
4                  No          No          No              No  Month-to-month   
...               ...         ...         ...             ...             ...   
7038              Yes         Yes         Yes             Yes        One year   
7039              Yes          No         Yes             Yes        One year   
7040               No          No          No              No  Month-to-month   
7041               No          No          No              No  Month-to-month   
7042              Yes         Yes         Yes             Yes        Two year   

     PaperlessBilling              PaymentMethod MonthlyCharges  TotalCharges  \
0                 Yes           Electronic check          29.85         29.85   
1                  No               Mailed check          56.95        1889.5   
2                 Yes               Mailed check          53.85        108.15   
3                  No  Bank transfer (automatic)          42.30       1840.75   
4                 Yes           Electronic check          70.70        151.65   
...               ...                        ...            ...           ...   
7038              Yes               Mailed check          84.80        1990.5   
7039              Yes    Credit card (automatic)         103.20        7362.9   
7040              Yes           Electronic check          29.60        346.45   
7041              Yes               Mailed check          74.40         306.6   
7042              Yes  Bank transfer (automatic)         105.65        6844.5   

     Churn  
0       No  
1       No  
2      Yes  
3       No  
4      Yes  
...    ...  
7038    No  
7039    No  
7040    No  
7041   Yes  
7042    No  

[7043 rows x 21 columns]
In [3]:
df.head(10)
Out[3]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity ... DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No ... No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes ... Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes ... No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes ... Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No ... No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
5 9305-CDSKC Female 0 No No 8 Yes Yes Fiber optic No ... Yes No Yes Yes Month-to-month Yes Electronic check 99.65 820.5 Yes
6 1452-KIOVK Male 0 No Yes 22 Yes Yes Fiber optic No ... No No Yes No Month-to-month Yes Credit card (automatic) 89.10 1949.4 No
7 6713-OKOMC Female 0 No No 10 No No phone service DSL Yes ... No No No No Month-to-month No Mailed check 29.75 301.9 No
8 7892-POOKP Female 0 Yes No 28 Yes Yes Fiber optic No ... Yes Yes Yes Yes Month-to-month Yes Electronic check 104.80 3046.05 Yes
9 6388-TABGU Male 0 No Yes 62 Yes No DSL Yes ... No No No No One year No Bank transfer (automatic) 56.15 3487.95 No

10 rows × 21 columns

The data set includes information about:

Customers who left within the last month – the column is called Churn

Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies

Customer account information - how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges

Demographic info about customers – gender, age range, and if they have partners and dependents

In [4]:
df.shape
Out[4]:
(7043, 21)
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 
 17  PaymentMethod     7043 non-null   object 
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   object 
 20  Churn             7043 non-null   object 
dtypes: float64(1), int64(2), object(18)
memory usage: 1.1+ MB
In [6]:
df.columns.values
Out[6]:
array(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'Churn'], dtype=object)
In [7]:
df.dtypes
Out[7]:
customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

Visualize the missing value

In [8]:
msno.matrix(df)
Out[8]:
<Axes: >
No description has been provided for this image

Data Manipulation

In [9]:
df=df.drop(['customerID'],axis=1)
In [10]:
df
Out[10]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No
7039 Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No
7040 Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes
7042 Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No

7043 rows × 20 columns

On deep analysis, we can find some indirect missingness in our data (which can be in form of blankspaces). Let's see that!

In [11]:
df['TotalCharges']=pd.to_numeric(df.TotalCharges,errors='coerce')
In [12]:
df.dtypes
Out[12]:
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object
In [13]:
df.isnull().sum()
Out[13]:
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64
In [14]:
msno.matrix(df)
Out[14]:
<Axes: >
No description has been provided for this image
In [15]:
df[np.isnan(df['TotalCharges'])]
Out[15]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 NaN No
753 Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 NaN No
936 Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 NaN No
1082 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 NaN No
1340 Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 NaN No
3331 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 NaN No
3826 Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 NaN No
4380 Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 NaN No
5218 Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 NaN No
6670 Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 NaN No
6754 Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 NaN No

It can also be noted that the Tenure column is 0 for these entries even though the MonthlyCharges column is not empty. Let's see if there are any other 0 values in the tenure column.

In [16]:
df[df['tenure']==0].index
Out[16]:
Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')

There are no additional missing values in the Tenure column. Let's delete the rows with missing values in Tenure columns since there are only 11 rows and deleting them will not affect the data.

In [17]:
df.drop(labels=df[df['tenure']==0].index,axis=0,inplace=True)
In [18]:
df
Out[18]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7039 Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7040 Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7042 Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 20 columns

In [19]:
df.skew(numeric_only=True)
Out[19]:
SeniorCitizen     1.831103
tenure            0.237731
MonthlyCharges   -0.222103
TotalCharges      0.961642
dtype: float64

To solve the problem of missing values in TotalCharges column, I decided to fill it with the mean of TotalCharges values.

In [20]:
df.fillna(df['TotalCharges'].mean())
Out[20]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7039 Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7040 Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7042 Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 20 columns

In [21]:
df.isnull().sum()
Out[21]:
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64
In [22]:
df['SeniorCitizen']=df['SeniorCitizen'].map({0:'No',1:'Yes'})
In [23]:
df
Out[23]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female No Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 Male No No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 Male No No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 Male No No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 Female No No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 Male No Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7039 Female No Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7040 Female No Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 Male Yes Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7042 Male No No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 20 columns

In [24]:
df.nunique()
Out[24]:
gender                 2
SeniorCitizen          2
Partner                2
Dependents             2
tenure                72
PhoneService           2
MultipleLines          3
InternetService        3
OnlineSecurity         3
OnlineBackup           3
DeviceProtection       3
TechSupport            3
StreamingTV            3
StreamingMovies        3
Contract               3
PaperlessBilling       2
PaymentMethod          4
MonthlyCharges      1584
TotalCharges        6530
Churn                  2
dtype: int64
In [25]:
df['InternetService'].describe()
Out[25]:
count            7032
unique              3
top       Fiber optic
freq             3096
Name: InternetService, dtype: object
In [26]:
df.dtypes
Out[26]:
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object
In [27]:
num=['tenure','MonthlyCharges','TotalCharges']
df[num].describe()
Out[27]:
tenure MonthlyCharges TotalCharges
count 7032.000000 7032.000000 7032.000000
mean 32.421786 64.798208 2283.300441
std 24.545260 30.085974 2266.771362
min 1.000000 18.250000 18.800000
25% 9.000000 35.587500 401.450000
50% 29.000000 70.350000 1397.475000
75% 55.000000 89.862500 3794.737500
max 72.000000 118.750000 8684.800000

Data Visualization

In [28]:
g_labels=['Male','Female']
c_labels=['No','Yes']
fig=make_subplots(rows=1,cols=2,specs=[[{'type':'domain'},{'type':'domain'}]])
fig.add_trace(go.Pie(labels=g_labels, values=df['gender'].value_counts(), name="Gender"),1,1)
fig.add_trace(go.Pie(labels=c_labels, values=df['Churn'].value_counts(),name='Churn'),1,2)
fig.update_traces(hole=.4,hoverinfo="label+percent+name",textfont_size=16)
fig.update_layout(
    title_text="<b>Gender and Churn Distribution<b>",
    annotations=[dict(text='Gender',x=0.19,y=0.5,font_size=16,showarrow=False),
               dict(text="Churn",x=0.8,y=0.5,font_size=16,showarrow=False)])
fig.show()

26.6 % of customers switched to another firm. The pie charts show that the dataset has an almost equal distribution of male and female customers. Overall churn percentage is smaller compared to customers who stayed. This suggests that gender does not have a strong impact on churn likelihood in this dataset.

In [29]:
df["Churn"][df["Churn"]=='No'].groupby(by=df['gender']).count()
Out[29]:
gender
Female    2544
Male      2619
Name: Churn, dtype: int64
In [30]:
df["Churn"][df["Churn"]=='Yes'].groupby(by=df['gender']).count()
Out[30]:
gender
Female    939
Male      930
Name: Churn, dtype: int64
In [31]:
plt.figure(figsize=(6,6))
labels=['Churn: Yes',"Churn: No"]
values=[1869,5163]
colors=["plum","wheat"]
labels_gender=["F","M","F","M"]
values_gender=[939,930,2544,2619]
colors_gender=['skyblue','lightpink','skyblue','lightpink']
explode=[0.3,0.3]
explode_gender=[0.1,0.1,0.1,0.1]
textprops={"fontsize":13}
textprops1={"fontsize":10}

plt.pie(
    values,
    labels=labels,
    autopct='%1.1f%%',
    pctdistance=1.08,
    labeldistance=0.8,
    colors=colors,
    startangle=90,
    frame=True,
    explode=explode,
    radius=10,
    textprops=textprops,
    counterclock=True
)
plt.pie(
    values_gender,
    labels=labels_gender,
    autopct='%1.1f%%',
    pctdistance=0.55,
    labeldistance=0.82,
    colors=colors_gender,
    startangle=90,
    explode=explode_gender,
    radius=7,
    textprops=textprops1,
    counterclock=True
)

centre_circle=plt.Circle((0,0),5,color='black',fc='white',linewidth=0)
fig=plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title("Churn distribution w.r.t Gender: Male(M) , Female(F)",fontsize=11,y=1.1)

plt.axis('equal')
plt.tight_layout()
plt.legend(fontsize=8)
plt.show()
No description has been provided for this image

Both males and females have similar churn proportions. This indicates that customer gender is not a major factor influencing whether they leave the service. Hence, churn behavior appears consistent across both genders.

In [32]:
df.dtypes
Out[32]:
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object
In [33]:
fig=px.histogram(df, x='Churn', color='Contract', barmode='group', title="<b>Customer Contract Distribution<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

Month-to-month contracts show the highest churn rate, unlike one- or two-year contracts. Customers with long-term commitments tend to remain loyal to the service provider. Offering incentives for longer contracts can help minimize churn.

In [34]:
labels=df['PaymentMethod'].unique()
values=df['PaymentMethod'].value_counts()
fig=go.Figure(data=[go.Pie(labels=labels,values=values,hole=0.3)])
fig.update_layout(width=700, title="<b>Payment Method Distribution<b>")
fig.show()
In [35]:
fig=px.histogram(df, x='Churn', color='PaymentMethod', title="<b>Customer Payment distribution w.r.t Churn<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

Customers paying through electronic checks show the highest churn. Those using credit cards or bank transfers remain more consistent customers. This may reflect convenience, automation, and lower billing issues with digital payments.

In [36]:
df['InternetService'].unique()
Out[36]:
array(['DSL', 'Fiber optic', 'No'], dtype=object)
In [37]:
df[df["gender"]=="Female"][["InternetService","Churn"]].value_counts()
Out[37]:
InternetService  Churn
DSL              No       965
Fiber optic      No       889
No               No       690
Fiber optic      Yes      664
DSL              Yes      219
No               Yes       56
Name: count, dtype: int64
In [38]:
df[df["gender"]=="Male"][["InternetService","Churn"]].value_counts()
Out[38]:
InternetService  Churn
DSL              No       992
Fiber optic      No       910
No               No       717
Fiber optic      Yes      633
DSL              Yes      240
No               Yes       57
Name: count, dtype: int64
In [39]:
fig=go.Figure()

fig.add_trace(go.Bar(
    x=[['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
      ['Female', 'Male', 'Female', 'Male']],
    y=[965,992,219,240],
    name='DSL'
))
fig.add_trace(go.Bar(
    x=[['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
      ['Female', 'Male', 'Female', 'Male']],
    y=[889,910,664,633],
    name='Fibre Optic'
))
fig.add_trace(go.Bar(
    x=[['Churn:No', 'Churn:No', 'Churn:Yes', 'Churn:Yes'],
      ['Female', 'Male', 'Female', 'Male']],
    y=[690,717,717,57]
))

fig.update_layout(width=900, height=400, title="<b>Churn Distribution w.r.t to Internet services and Gender<b>")
fig.show()

A lot of customers choose the Fiber optic service and it's also evident that the customers who use Fiber optic have high churn rate, this might suggest a dissatisfaction with this type of internet service. Customers having DSL service are majority in number and have less churn rate compared to Fibre optic service.

In [40]:
color={"Yes":"violet","No":"cyan"}
fig=px.histogram(
    df,
    x='Churn',
    color='Dependents',
    barmode='group',
    title="<b>Dependents Distribution<b>",
    color_discrete_map=color
)
fig.update_layout(
    width=700,
    height=500,
    bargap=0.1
)
fig.show()

Customers with partners or dependents tend to stay longer and churn less frequently. This could be because family users rely more on stable communication services. Hence, single or independent customers are more at risk of leaving.

In [41]:
color={"Yes":"hotpink","No":"lightblue"}
fig=px.histogram(
    df,
    x="Churn",
    color="Partner",
    barmode="group",
    title="<b>Churn Distribution w.r.t Partner<b>",
    color_discrete_map=color
)
fig.update_layout(
    width=700,
    height=500,
    bargap=0.1
)
fig.show()
In [42]:
color={"Yes":"palegreen","No":"khaki"}
fig=px.histogram(
    df,
    x="Churn",
    color="SeniorCitizen",
    barmode="group",
    title="<b>Churn Distribution w.r.t Senior Citizen<b>",
    color_discrete_map=color
)
fig.update_layout(
    width=700,
    height=500,
    bargap=0.1
)
fig.show()

Senior citizens exhibit a noticeably higher churn rate than non-senior customers. This suggests that age or technology comfort level may influence customer satisfaction. Older customers may require more personalized support to improve retention.

In [43]:
color={"Yes":"deeppink","No":"darkviolet","No internet service":"lightgreen"}
fig=px.histogram(
    df,
    x="Churn",
    color="OnlineSecurity",
    barmode="group",
    title="<b>Churn Distribution w.r.t Online Security<b>",
    color_discrete_map=color
)
fig.update_layout(
    width=700,
    height=500,
    bargap=0.1
)
fig.show()

Customers lacking online security services churn more often than those who have them. It shows that value-added services like security increase satisfaction and retention. Encouraging such subscriptions could help in lowering churn rates.

In [44]:
color={"Yes":"deeppink","No":"darkviolet","No phone service":"lightgreen"}
fig=px.histogram(
    df,
    x="Churn",
    color="MultipleLines",
    barmode="group",
    title="<b>Churn Distribution w.r.t MultipleLines<b>",
    color_discrete_map=color
)
fig.update_layout(
    width=700,
    height=500,
    bargap=0.1
)
fig.show()

Customers with multiple lines or bundled services tend to churn less. This indicates that bundled or multi-service offerings increase stickiness. Cross-selling more services can be an effective churn reduction strategy.

In [45]:
color={"Yes":"black","No":"red"}
fig=px.histogram(
    df,
    x="Churn",
    color="PaperlessBilling",
    barmode="group",
    title="<b>Churn Distribution w.r.t PaperlessBilling<b>",
    color_discrete_map=color
)
fig.update_layout(
    width=700,
    height=500,
    bargap=0.1
)
fig.show()

Paperless billing users show higher churn rates than those using mailed bills. Such customers may be more digitally active and more likely to switch services. Targeted offers or engagement strategies for digital users could help retention.

In [46]:
fig=px.scatter(
    df,
    x="tenure",
    y="MonthlyCharges",
    color="Churn",
    title="<b>Tenure vs Monthly Charges By churn"
)
fig.update_layout(
    width=800,
    height=600
)
fig.show()
In [47]:
fig=px.scatter(
    df,
    x="tenure",
    y="MonthlyCharges",
    size="TotalCharges",
    color="Churn",
    title="<b>Customer value By Tenure and Charges<b>"
)
fig.update_layout(
    width=800,
    height=700
)
fig.show()
In [48]:
fig=px.box(
    df,
    x="Contract",
    y="MonthlyCharges",
    color="Churn",
    title="<b>Monthly Charges By Contract type and Churn<b>"
)
fig.update_layout(
    height=600
)
fig.show()
In [49]:
fig=px.violin(
    df,
    x="InternetService",
    y="tenure",
    color="Churn",
    title="<b>Tenure Distribution by Internet Services and Churn<b>"
)
fig.update_layout(
    height=600,
    
)
fig.show()
In [50]:
corr=df.corr(numeric_only=True)
fig=px.imshow(
    corr,
    text_auto=True,
    color_continuous_scale="PiYG",
    title="Correlation Heatmap"
)
fig.update_layout(
    width=700,
    height=500
)
fig.show()
In [51]:
sns.pairplot(df,vars=["tenure","MonthlyCharges","TotalCharges"],hue="Churn")
Out[51]:
<seaborn.axisgrid.PairGrid at 0x1d61ec22e40>
No description has been provided for this image

Encoding the categorical columns

In [52]:
le=LabelEncoder()
for col in df.select_dtypes("object").columns:
    df[col]=le.fit_transform(df[col])
df    
Out[52]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0 0 1 0 1 0 1 0 0 2 0 0 0 0 0 1 2 29.85 29.85 0
1 1 0 0 0 34 1 0 0 2 0 2 0 0 0 1 0 3 56.95 1889.50 0
2 1 0 0 0 2 1 0 0 2 2 0 0 0 0 0 1 3 53.85 108.15 1
3 1 0 0 0 45 0 1 0 2 0 2 2 0 0 1 0 0 42.30 1840.75 0
4 0 0 0 0 2 1 0 1 0 0 0 0 0 0 0 1 2 70.70 151.65 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 1 0 1 1 24 1 2 0 2 0 2 2 2 2 1 1 3 84.80 1990.50 0
7039 0 0 1 1 72 1 2 1 0 2 2 0 2 2 1 1 1 103.20 7362.90 0
7040 0 0 1 1 11 0 1 0 2 0 0 0 0 0 0 1 2 29.60 346.45 0
7041 1 1 1 0 4 1 2 1 0 0 0 0 0 0 0 1 3 74.40 306.60 1
7042 1 0 0 0 66 1 0 1 2 0 2 2 2 2 2 1 0 105.65 6844.50 0

7032 rows × 20 columns

In [53]:
x=df.drop('Churn',axis=1)
y=df['Churn']
In [54]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

Model Training

In [55]:
dt=DecisionTreeClassifier(criterion="entropy",random_state=42)
dt.fit(x_train,y_train)
dt_preds=dt.predict(x_test)
In [56]:
rf=RandomForestClassifier(random_state=42)
rf.fit(x_train,y_train)
rf_preds = rf.predict(x_test)
In [57]:
print("Decision Tree Accuracy:", accuracy_score(y_test, dt_preds))
print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))
Decision Tree Accuracy: 0.7448471926083866
Random Forest Accuracy: 0.7924662402274343

Here we can see that Random Forest model has more accuracy than the Decision Tree model.

In [58]:
print("DT Confusion Matrix:\n", confusion_matrix(y_test, dt_preds))
print("RF Confusion Matrix:\n", confusion_matrix(y_test, rf_preds))
DT Confusion Matrix:
 [[841 192]
 [167 207]]
RF Confusion Matrix:
 [[932 101]
 [191 183]]

Confusion Matrix

In [59]:
dt_conf=confusion_matrix(y_test, dt_preds)
sns.heatmap(
    dt_conf,
    annot=True,
    fmt='d',
    cmap='Blues',
    xticklabels=['No Churn','Churn'],
    yticklabels=['No Churn','Churn'],
)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Decision Tree')
plt.show()
No description has been provided for this image
In [60]:
rf_conf=confusion_matrix(y_test, rf_preds)
sns.heatmap(
    rf_conf,
    annot=True,
    fmt='d',
    cmap='magma',
    xticklabels=['No Churn','Churn'],
    yticklabels=['No Churn','Churn'],
)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix of Random Forest')
plt.show()
No description has been provided for this image

Classsification Reports

In [61]:
dt_report=classification_report(y_test,dt_preds)
print(dt_report)
              precision    recall  f1-score   support

           0       0.83      0.81      0.82      1033
           1       0.52      0.55      0.54       374

    accuracy                           0.74      1407
   macro avg       0.68      0.68      0.68      1407
weighted avg       0.75      0.74      0.75      1407

In [62]:
rf_report=classification_report(y_test,rf_preds)
print(rf_report)
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1033
           1       0.64      0.49      0.56       374

    accuracy                           0.79      1407
   macro avg       0.74      0.70      0.71      1407
weighted avg       0.78      0.79      0.78      1407

In [ ]:
 
In [ ]:
 
In [ ]: